kurfessfandomcom-20200214-history
481-W12-klpaters-InformationExtractionNLP
ABSTRACT: Natural language processing can assist in tasks such as information extraction when the information your are trying to extract comes from a natural language text source in several ways, namely, by using linguistic knowledge to create structure for free text. At the most basic level, this involves syntactic tagging and parsing, where text is first tagged with a part of speech and then combined into a parse tree to represent relationships among different constituent parts of a sentence. Once text has been parsed, we may not be able to extract the information until we "know" what it means. For example, how do we know every noun phrase isn't a person? In order to take a more sophisticated approach, some semantic information is necessary -- this is where expert systems become a valuable tool for information extraction. By using rules to match and combine pieces of a sentence into "semantic" constituents, we can create meaning across a sentence, and then extract the important pieces. SLIDES: Click on the link here: https://docs.google.com/presentation/d/1bjTRQUDXrP4o1ZHcSAypbr7Ug3En8Mcq78kuWqI78OM/edit OUTLINE: Natural Language Processing and Information Extraction Outline Define terms 1. Natural Language Processing – the field of working with computers to process natural language 2. Information Extraction – a type of information retrieval that strives to automatically extract information from structures or unstructured sources and turn it into structured data (Wikipedia/Information_extraction) For example, if I want to know the date of birth of each person for whom I have their obituary, how do I get this information out? This is a task for information extraction, that will be helped along by NLP and the use of an expert system. What's the connection between NLP and IE? When your data is natural language, the structure you get out of the data often requires linguistic understanding of the form and meaning of a natural language Syntax and Semantics are the primary tools that come from NLP to assist us Let's go back to the obituary example. Say I have this obituary: John Smith was born on January 1st, 1948. As it stands, the computer does not understand that John Smith is a person, that he was born, that only living things can be born, and that January 1st, 1948 is a date, and is the date that John Smith was born. How do we do this? Steps to information extraction: POS Tagging Parsing Semantic combinations Information extraction The first thing we need to do is to break down the example to its most basic linguistic elements: POS We have a few “bits” in our toolbox for tagging our example: we know there are several types of words: i.e. nouns, verbs, adjectives etc. John/NNP Smith/NNP was/VBD born/VBN on/IN January/NNP 1st/CD ,/, 1948/CD ./. For those of you who don't speak stanford parser: we can represent this more simplistically: (use just nouns) example We have now tagged the sentence by its parts of speech-- but even this doesn't help us that much, partly because a date of birth has two main components, a date and a person who was born on that day. The POS doesn't give us this, so we have to take it a little farther by parsing it we can get a tree out of it. This gives us a little more context, because now we have a subject, a verb phrase, and a noun phrase. So we understand how pieces of the sentence are related But this combination of doesn't give us much semantic information. What we need is a way to semantically tag this example. The idea is to get a parse tree that contains semantic information: John Smith /name was born /birth on January 1st, 1948 /date We can now define a rule: A birth date is composed of a person, a birth and a date. Just like we did for sytactic parsing, we start from the bottom up, in a process called chunking to arrive at the final level: /person /name John / firstname Smith / lastname /birth /was? born/born keyword /date /on? January/month 1st/day 1948/year Tools for information extraction Chunking Natural Language Toolkit (NLTK) in Python The more structured your source is to begin with, the easier it is to extract information from it SOURCES: "Information Extraction." Wikipedia. Wikimedia Foundation, 27 Jan. 2013. Web. 14 Feb. 2013. "The Stanford Parser: A Statistical Parser." The Stanford NLP (Natural Language Processing) Group. N.p., n.d. Web. 14 Feb. 2013. (http://nlp.stanford.edu/software/lex-parser.shtml) "Natural Language Toolkit." Natural Language Toolkit€” NLTK 2.0 Documentation. N.p., n.d. Web. 14 Feb. 2013. (nltk.org) Khoshmood, Foaad, Dr. "CSC 580-Formal Grammars of English, Context-Free Grammars, Parsing." CSC 580 Lecture. Cal Poly, San Luis Obispo. Oct. 2012. Lecture. Thanks to Ancestry.com for teaching me about IE